0%

(NIPS 2016) Learning what and where to draw

Keyword [Location-Controllable]

Reed S E, Akata Z, Mohan S, et al. Learning what and where to draw[C]//Advances in Neural Information Processing Systems. 2016: 217-225.



1. Overview


1.1. Motivation

  • existing methods synthesize images based on global constraints (class label and caption), do not provide control over pose or object location

In this paper, it proposed Generative Adversarial What-WHere Network (GAWWN)

  • synthesize images given instructions describing that content to draw in which location
    • condition on coarse location. Implemented using STN
    • condition on part location. set of normalized (x, y) coordinates


1.2. Contribution

  • novel architecture for text- and location-controllable image synthesis
  • text-conditional object part completion model enabling a streamlined user interface for part locations
  • CNN (deterministic)
  • VAE, convolutional VAE, recurrent VAE (probabilistic)
  • GAN
  • STN

1.4. Future Work

  • learn the obj and part location in an unsupervised or weakly supervised way



2. GAWWN


2.1. Bounding-Box-Conditional Text-to-Image Model



  • replicate text embedding spatially to form a MxMxT feature map
  • warp spatially to fit into the normalized bounding box coordinates (outside the box are all zeros)

2.2. Keypoint-Conditional Text-to-Image Model



  • location key points are encoded into a MxMxK spatial feature map (channels correspond to the part)
  • max replicate depth

2.3. Conditional Keypoint Generation Model

Not optimal to require users to enter every single keypoint of the parts of object they wish to be drawn. it would be useful to have access to all of the conditional distributions of unobserved ketpoints given a subset of observed keypoints and the text description.



3. Experiments


3.1. Details

  • text embedding. char-CNN-GRU
  • 0.0002 Adam, batch size 16

3.2. With Bounding Box



3.3. Via Keypoints




3.4. Comparison